.
THE OFFICE - TEXT ANALYSIS

This project’s main purpose is to analyze a TV show in a reliable and measurable way, without the need to watch the whole show or rely on a personal perspective. The selected subject for this analysis is the sitcom ‘The Office’, which was selected mainly for the high availability of data.

This notebook will use the data previously collected and cleaned, to go through the analysis process.

Prepare Enviorment¶

The non-standard libraries used in this notebook are:

Pandas

for data wrangling;  

Numpy, Sklearn, sciPy, and spaCy

for mathematical, statistical and machine-learning related tasks;  

NetworkX, matplotlib, Seaborn

for visualizations;
Main dataset, first 5 rows:
Out[5]:
id text name episode_name episode_number season ep_seas clean_txt sentences words sentences_qty words_qty negative neutral positive compound
0 2 All right Jim. Your quarterlies look very goo... Michael PILOT 01 1 01-01 right jim quarterlies look good things library [' All right Jim.', 'Your quarterlies look ver... ['all', 'right', 'jim', 'your', 'quarterlies',... 3 14 0.0 0.803 0.197 0.4927
1 3 Oh, I told you. I couldn't close it. So... Jim PILOT 01 1 01-01 oh told couldnt close [' Oh, I told you.', "I couldn't close it.", '... ['oh', 'i', 'told', 'you', 'i', 'couldnt', 'cl... 3 9 0.0 1.000 0.000 0.0000
2 4 So you've come to the master for guidance? Is... Michael PILOT 01 1 01-01 youve come master guidance youre saying grassh... [" So you've come to the master for guidance?"... ['so', 'youve', 'come', 'to', 'the', 'master',... 2 14 0.0 1.000 0.000 0.0000
3 5 Actually, you called me in here, but yeah. Jim PILOT 01 1 01-01 actually called yeah [' Actually, you called me in here, but yeah.'] ['actually', 'you', 'called', 'me', 'in', 'her... 1 8 0.0 0.714 0.286 0.4215
4 6 All right. Well, let me show you how it's done. Michael PILOT 01 1 01-01 right well let show done [' All right.', "Well, let me show you how it'... ['all', 'right', 'well', 'let', 'me', 'show', ... 2 10 0.0 0.811 0.189 0.2732

Main Characters¶

The first step to analyze the show is to define who are the main characters.

This may have several different interpretations, but for this project, we are considering some metrics to do so. The points to be considered are:

  • The total amount of dialogs a character had;
  • The total amount of episodes the character had a dialog in;
  • The number of seasons the character appeared on;

The number of dialogs and episodes is the main indicator of who are the main characters since the characters who had the biggest proportions of dialog and appeared in most of the episodes receive more attention and therefore should be the main characters.

There's a challenge in this part because of special guests and characters which had very big importance for a short amount of time. Those characters appear to had lots of dialogs, they participate in lots of episodes, but they're just around for a couple of seasons at most, this is why we're considering the season to calculate the main characters score.

To solve this issue I developed a score that considers all the above-mentioned approaches.

Group by character¶

We start by aggregating the numerical fields and get their respective descriptive statistics such as means, standard deviations, medians, and other aggregations

Score¶

This is the indicator I developed to help to find the main characters of the series and classify them by relevance to the show.

Score = nep + (nd / nep) * (ns/5)

nep = number episodes;
nd = number of dialogs;
ns = number of seasons;

5 is a threshold I used.
The idea is to "penalize" characters that appeared in less than 5 seasons (approx. half the series) and give more significance to characters that appeared in more than 5 seasons.

Out[8]:
chars dialogs avg_words std_words 25%_median_words 50%_median_words 75%_median_words avg_sentences std_sentences positive neutral negative compound total_words unique_s unique_ep score
184 Michael 10960.0 13.585675 16.290656 4.0 8.0 17.0 2.343704 2.162617 0.193529 0.731349 0.075123 0.146149 148899 8 139 265.158273
81 Dwight 6852.0 11.050058 12.467607 3.0 7.0 14.0 1.972709 1.624145 0.155236 0.762325 0.082144 0.083521 75715 9 188 253.604255
131 Jim 6314.0 9.225055 10.859843 3.0 6.0 12.0 1.649034 1.195340 0.195979 0.738376 0.065648 0.135329 58247 9 187 247.776471
205 Pam 5035.0 8.962860 10.792735 3.0 6.0 11.0 1.598808 1.242186 0.198712 0.734246 0.067045 0.129266 45128 9 184 233.255435
153 Kevin 1567.0 8.002553 8.765353 2.0 5.0 10.0 1.560306 1.068548 0.179865 0.744870 0.075264 0.094385 12540 9 182 197.497802
12 Angela 1564.0 8.598465 9.622824 3.0 6.0 11.0 1.615729 1.263710 0.153496 0.739256 0.106611 0.053237 13448 9 173 189.272832
10 Andy 3780.0 11.747884 12.704352 3.0 8.0 16.0 1.981746 1.574955 0.176831 0.752245 0.070660 0.135988 44407 7 145 181.496552
204 Oscar 1366.0 8.791362 9.039496 3.0 6.0 12.0 1.560761 1.063706 0.152761 0.784579 0.062661 0.080598 12009 9 166 180.812048
216 Phyllis 976.0 7.695697 7.513722 3.0 6.0 10.0 1.376025 0.721110 0.159429 0.768445 0.072122 0.085728 7511 9 168 178.457143
251 Stanley 686.0 8.651603 8.996003 3.0 6.0 11.0 1.447522 0.960239 0.110771 0.801340 0.087892 0.026518 5935 9 168 175.350000
238 Ryan Howard 1212.0 9.922442 10.506602 3.0 6.0 13.0 1.623762 1.185668 0.172592 0.764349 0.063062 0.120825 12026 9 142 157.363380
148 Kelly 848.0 10.920991 11.498150 4.0 8.0 13.0 1.678066 1.195778 0.149959 0.771801 0.078242 0.094249 9261 9 144 154.600000
182 Meredith 562.0 7.798932 6.972096 3.0 6.0 11.0 1.626335 1.199770 0.164235 0.741536 0.094222 0.049518 4383 9 134 141.549254
58 Creed 408.0 9.593137 10.038312 3.0 6.0 11.0 1.784314 1.418567 0.142757 0.779772 0.075025 0.077308 3914 8 130 135.021538
67 Darryl 1234.0 9.528363 9.806295 3.0 6.0 12.0 1.719611 1.170392 0.161492 0.766948 0.071566 0.093145 11758 9 111 131.010811

Main characters according to the score are:

Out[9]:
['Michael',
 'Dwight',
 'Jim',
 'Pam',
 'Kevin',
 'Angela',
 'Andy',
 'Oscar',
 'Phyllis',
 'Stanley',
 'Ryan Howard',
 'Kelly',
 'Meredith',
 'Creed',
 'Darryl']
Out[12]:
chars unique_ep dialogs unique_s score
184 Michael 139 10960.0 8 265.158273
81 Dwight 188 6852.0 9 253.604255
131 Jim 187 6314.0 9 247.776471
205 Pam 184 5035.0 9 233.255435
153 Kevin 182 1567.0 9 197.497802
12 Angela 173 1564.0 9 189.272832
10 Andy 145 3780.0 7 181.496552
204 Oscar 166 1366.0 9 180.812048
216 Phyllis 168 976.0 9 178.457143
251 Stanley 168 686.0 9 175.350000
238 Ryan Howard 142 1212.0 9 157.363380
148 Kelly 144 848.0 9 154.600000
182 Meredith 134 562.0 9 141.549254
58 Creed 130 408.0 8 135.021538
67 Darryl 111 1234.0 9 131.010811

Episodes, Dialogs and Seasons¶

To test our score we can compare the distributions for our selected variables.

In the bellow chart the values are displayed as:

  • X-axis = Number of episodes;
  • Y-axis = Number of dialogs;
  • Size = Number of seasons (more seasons = bigger markers);

The chart compares the main characters(red) selected by the score with all the other characters(blue).

Total and Average Dialogs¶

We can also see how the score is handling the 'Average dialogs', in the bellow chart we have:

  • X-axis = Average Dialogs;
  • Y-axis = Total Dialogs;
  • Size = Number of Seasons;

Words and Sentences¶

A very interesting characteristic we can analyze is the number of words and sentences a character says, characters with a high average of those are the ones who have lots to talk about, they don't just react to situations, they have something to add to it.

We can say that subjectivity is something like the number of words you said minus what you're actually saying. So saying lots of words to pass a small message means there's lots of subjectivity

Mean¶

In the bellow displayed chart, we can see the blue bars representing the means, and the black lines the standard deviation of those, the problem, in this case, is that there's a huge difference between the mean and the standard deviations. This means our data have extreme outliers, so the averages are not such a good indication of who talks more or less, they just give us a slight idea of it.

Median¶

Since we're not able to get the full understanding with the means, we can analyse the medians for those characters.

Medians are more interesting, there is one character in specific that is different from everyone else. Kevin, from all the main characters uses in median the lowest amount of words per dialogs.

And we can easily find evidence of that.
https://www.youtube.com/watch?v=_K-L9uhsBLM

Sentiment analysis¶

To analyse the overall sentiment polarity of the show and it's main characters we're using VADER, please consult the data cleaning and preparation notebook for more information about this method and its implementation.

Visualize Polarities¶

The visualization of the results aims at displaying the characters of the show, and the average positive and negative sentiments for each of them.

For a fair perspective of those values we're comparing them in the same scales where:

  • Range = [ min( positive, negative ), max( positive, negative ) ]

So the range of 0.04 to 0.23, is applied to both the x and y axis.

We can see that most characters have a similar behavior in matters of polarity in their dialogs, the values concentrate in high positive and low negative for the vast majority of them, but we can also see some outliers away from the group.

Outliers¶

As mentioned before, most of the characters have a high positive score of around 0.14 to 0.20, with a low negative score of 0.06 to 0.8.
But we can note some characters with higher negative scores and also a character with a lower positive score.

Stanley, is the most distant from the other characters, he has a relatively low positive score but his negative score isn't so high either.

This means his dialogs are mostly neutral, almost like he doesn't want to get involved. https://www.youtube.com/watch?v=iahcJPo9Dwg

Out[18]:
chars dialogs avg_words positive neutral negative compound unique_s unique_ep score
182 Meredith 562.0 7.798932 0.164235 0.741536 0.094222 0.049518 9 134 141.549254
12 Angela 1564.0 8.598465 0.153496 0.739256 0.106611 0.053237 9 173 189.272832
251 Stanley 686.0 8.651603 0.110771 0.801340 0.087892 0.026518 9 168 175.350000

Relationships¶

The file 'conversations.json' contains one record for every scene on the show, where the record contains the name of the characters that had some dialog in the scene and the respective number of dialogs that character had.

These conversations will be used to calculate a score for the relations between the characters.

first 5 rows:
Out[19]:
[{'Michael': 3, 'Jim': 2},
 {'Michael': 5, 'Pam': 4},
 {'Michael': 6, 'Jim': 3, 'Dwight': 2},
 {'Michael': 11, 'Pam': 2},
 {'Phyllis': 1, 'Stanley': 1}]

Scores¶

In order to compare the relationship between the characters the following formula was developed:

∑min(nx,ny)/max(nx,ny)

Where:

nx = number of dialogs character x had in a conversation;
ny = number of dialogs character y had in a conversation;


This score is based on the concept that a perfectly balanced conversation will have the same amount of dialogs between both agents.

E.g.: A conversation with three characters x, y and z;
Where x said 5 dialogs, y said 5 dialogs, and z said 1 dialog will result in a score between x and y of 1, while the score between x and z will be 0.2.


The scores are them aggregated with all scores from the same relation so they can be compared, it's important to note that this will result in generally higher scores for characters that communicate a lot and lower scores for characters that don't.

After calculating the relationship scores for every character of the show we have the following data:

Out[22]:
Kevin Phyllis Meredith Angela Stanley Oscar Pam Jim Darryl Creed Ryan Howard Dwight Kelly Andy Michael
names
Kevin 0.000000 109.195238 68.806818 154.564286 65.969048 170.094913 147.983586 142.731019 66.178510 49.209524 44.191450 123.556432 58.044444 125.062612 108.404618
Phyllis 109.195238 0.000000 58.662338 103.914286 133.944048 107.298810 130.937843 118.793685 45.466667 41.485714 41.692857 142.265707 55.544444 105.159174 82.771664
Meredith 68.806818 58.662338 0.000000 59.121861 44.815909 69.755195 76.134199 53.527924 25.266667 32.631818 22.500000 69.378304 39.252020 59.419120 51.747815
Angela 154.564286 103.914286 59.121861 0.000000 56.385714 159.026190 124.944444 70.692491 22.976190 32.319048 25.576190 196.329949 60.100000 84.234423 78.417826
Stanley 65.969048 133.944048 44.815909 56.385714 0.000000 71.201190 77.652092 84.074060 22.366667 34.250000 38.444048 93.269719 31.594444 76.107937 77.463557
Oscar 170.094913 107.298810 69.755195 159.026190 71.201190 0.000000 127.759679 107.622387 52.383333 48.391667 46.430357 117.605836 46.766703 96.470033 99.879648
Pam 147.983586 130.937843 76.134199 124.944444 77.652092 127.759679 0.000000 587.992857 52.899206 42.349639 81.944931 238.176138 75.771176 130.851199 301.972284
Jim 142.731019 118.793685 53.527924 70.692491 84.074060 107.622387 587.992857 0.000000 58.863492 46.019208 87.016138 452.530350 63.588468 183.466818 293.039036
Darryl 66.178510 45.466667 25.266667 22.976190 22.366667 52.383333 52.899206 58.863492 0.000000 11.342857 20.764286 60.986722 24.885714 85.220202 69.502232
Creed 49.209524 41.485714 32.631818 32.319048 34.250000 48.391667 42.349639 46.019208 11.342857 0.000000 20.602381 49.598413 21.111111 32.715666 40.354334
Ryan Howard 44.191450 41.692857 22.500000 25.576190 38.444048 46.430357 81.944931 87.016138 20.764286 20.602381 0.000000 89.840079 79.769444 40.667532 124.703359
Dwight 123.556432 142.265707 69.378304 196.329949 93.269719 117.605836 238.176138 452.530350 60.986722 49.598413 89.840079 0.000000 65.374206 206.445069 456.741376
Kelly 58.044444 55.544444 39.252020 60.100000 31.594444 46.766703 75.771176 63.588468 24.885714 21.111111 79.769444 65.374206 0.000000 53.457479 57.788593
Andy 125.062612 105.159174 59.419120 84.234423 76.107937 96.470033 130.851199 183.466818 85.220202 32.715666 40.667532 206.445069 53.457479 0.000000 113.068009
Michael 108.404618 82.771664 51.747815 78.417826 77.463557 99.879648 301.972284 293.039036 69.502232 40.354334 124.703359 456.741376 57.788593 113.068009 0.000000

Visualize Scores¶

At this point we'll start comparing the relationships and describing them as 'strong' or 'weak', depending on the value of their scores. It's important to note that a strong relationship in this context doesn't relate to the sentiment involved between the characters, so it won't necessarily be a positive relation.

In this context, a strong relationship means the characters communicate a lot.

By themselves the scores are already very meaningful, we can tell that Pam and Jim have the strongest relationship among all the other relations.

We can also notice that Michael, the main character of the show, has an overall higher score with everybody when compared to 'lower-ranked' main characters such as Meredith, Creed, or Darryl.

This makes sense from the perspective that Michael has been communicating more constantly with everybody in the show, so he probably has a stronger relationship with most characters.

Normalize¶

To extract even more information about the relationships we can normalize the scores, in this case we'll do so by standarizing the values, or calculating their z-scores. This will allow us to see how many standard deviations aways from the mean each relation is.

Simplyfing, we want to see how extreme are those relationships for each characther.

Visualize P-Values¶

One way of improving this visualization is by showing the actual p-values, they represent how likelly it should be to find those values in the distribution.

In this case, we'll look for relationships with a lower than 0.05 p-value, to account for 95% of confidence level that those relationships have a statistically significant difference from the average relationships of the analysed characters.

Strongest Relationships¶

With 95% of confidence, the bellow listed relationships had a higher amount of conversation score than the average relationships.

Michael -> Dwight 

Dwight -> Michael
Dwight -> Jim 

Jim -> Dwight 
Jim -> Pam 

Pam -> Jim 

Angela -> Dwight 

Andy -> Dwight

Darryl -> Andy

Ryan -> Michael

Stanley -> Phyllis

Visualize the strongest relationships in a network chart

Words Frequency¶

The word and terms frequency can give us an interesting perspective of how the characters communicate and what the show is about.

Most Frequent Terms¶

To start we can visualize the show's most frequent words in a word cloud, to do that we're using a bag-of-words algorithm that'll select and display the words and terms with the highest frequency.

We can see in the above visualization that many of the words relate to people, words such as names, and pronouns are very common in their daily communications. We can also see that many of those words have little to no meaning by themselves.

To improve on that we can check what are the distinguishable terms spoken by the characters, in other words, we'll remove words that are common to all characters and focus on the words that are specific to each of the main characters.

TF-IDF¶

Term Frequency - Inverted document Frequency (TF-IDF), is a way method to compare how many times a term appeared in a document with how many documents the term appeared in.

Term Frequency( t, d ) * Inverse Document Frequency( t )

Term = t
Document = d

Get the difference between the mean score for all characters and the character score, this will show how above or bellow the average each words was said by character.
The result is then sorted to get the most above the average words for each character.

*test*
How many times the words business appears: 2491
last 5 rows:
Out[30]:
word Michael Jim Pam Dwight Phyllis Stanley Oscar Angela Kevin Ryan Howard Kelly Meredith Darryl Creed Andy sum_score mean_score
zoppity 0.001527 0.0 0.0 0.00000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.001527 0.000204
zoran 0.000000 0.0 0.0 0.00169 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.001690 0.000225
zuckerberg 0.000000 0.0 0.0 0.00169 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.001690 0.000225
zuckerberged 0.000000 0.0 0.0 0.00000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.008459 0.0 0.0 0.008459 0.001128
zwarte 0.000000 0.0 0.0 0.00000 0.0 0.0 0.009056 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.009056 0.001207

The 10 most distinct words by character

Out[31]:
Kevin Phyllis Meredith Angela Stanley Oscar Pam Jim Darryl Creed Ryan Howard Dwight Kelly Andy Michael
0 oscar bob wait sprinkles florida gay um awimowheh mike bratton kelly schrute ryan tuna everybody
1 warning clients van senator scarn angela cece alright man debbie wuphf mose god bernard jan
2 shred ho minute dwight mind kevin mural nope val creed presentation jim fashion erin alright
3 awesome bobs jakey cat toaster senator paint wow warehouse boss thailand ha blah cornell holly
4 phillip personality aint kevin damn dollars gosh rundown yall betty pesto assistant ruff bum scott
5 stacy luke lice phillip heh blake chore definitely justine brown um hay bridesmaid flag beep
6 ooh lettuce manuel senators pretzel angelas jims assistant truck swiss silicon sensei cuz andrew sort
7 cress fanned meredith oscar hudson le brian warmer tacos cartwheel treated manager dots treble somebody
8 dunhduhnadah afghani alcoholic contract wallpaper gerald art uh beanie persons powerpoint regional ravi fail david
9 fluke birdhouse vagina ugh lost spend dad beesly jada devon drake idiot obsessed rob carol

Individual Characters¶

One of the many ways of breaking down all this data is by analysing the characters individually, from this point on the previously discussed methods will be addapted for a single character.

Beside from the previously seem data, in this section we'll also explore the ratings.

The options are:
['Kevin' 'Phyllis' 'Meredith' 'Angela' 'Stanley' 'Oscar' 'Pam' 'Jim'
 'Darryl' 'Creed' 'Ryan Howard' 'Dwight' 'Kelly' 'Andy' 'Michael']

Selected character is: Michael

Sentiment Polarity¶

The polarity scores for each dialog were generated by VADER, please consult the data cleaning and preparation notebook for more information about this method and its implementation.

Normalize Polarity¶

The sentiment analysis displays high amounts of Neutral interactions and low amounts of negative and positive for most characters. To better visualize the small differences between those scores we can normalize them.

Out[33]:
POS NEU NEG
chars
Kevin 0.667510 -0.672402 -0.101005
Phyllis -0.220853 0.468629 -0.360813
Meredith -0.011946 -0.833780 1.466491
Angela -0.478777 -0.944125 2.490819
Stanley -2.336108 2.060757 0.943092
Oscar -0.510720 1.249539 -1.143053
Pam 1.486826 -1.186604 -0.780542
Jim 1.368037 -0.986721 -0.896094
Darryl -0.131188 0.396196 -0.406808
Creed -0.945611 1.016878 -0.120823
Ryan Howard 0.351370 0.270397 -1.109912
Dwight -0.403143 0.172421 0.467807
Kelly -0.632555 0.631062 0.145183
Andy 0.535616 -0.315455 -0.481661
Michael 1.261541 -1.326792 -0.112680

Radar Charts¶

To visualize the three normalized variables (positive, negative, and neutral), we'll be using radar charts, with the normalized data we can more easily compare the extents of each polarity in the selected character.

Polarity Distribution¶

We can also visualize the distribution of the polarity trough the episodes, this should allow us to see changes in the character behavior and outliers that may be interesting to look closer.

Words and Terms¶

In this section, we'll repeat the methods used in '5 - Words Frequency', but this time with a single character, and we'll also add a method from spaCy, that can help us identify the entities mentioned in the dialogs.

Distinguish Terms¶

Here we can analyze the most distinguishable terms for a specific character, the sizes are adjusted as per the more distinguishable the term the bigger the font size.

EVERYBODY

JAN

ALRIGHT

HOLLY

SCOTT

BEEP

SORT

SOMEBODY

DAVID

CAROL

Most Frequent Words¶

Here we're building a word cloud with the most frequent terms the character said, the cleaned version of the text is being used for visualization.

Entities¶

Here we'll visualize what are the most commonly mentioned entities, more specifically in this section we'll filter people, organizations, products, locations and events mentioned in the dialogs and them we'll count them to visualize the most mentioned in the show by the selected character

Out[39]:
count
name type
Dwight PERSON 275
Jim PERSON 214
Pam PERSON 148
Ryan PERSON 134
Stanley PERSON 122
Phyllis PERSON 98
Oscar PERSON 95
Kevin PERSON 91
Michael Scott PERSON 71
Jan PERSON 71
Andy PERSON 64
Michael PERSON 62
Scranton ORG 60
David PERSON 46
Holly PERSON 46

In regards to Michael, we can see something common between the words and terms frequencies. They're all strongly related to people.

In the TF-IDF scores, Michael's most distinguishable words have 2 pronouns (Everybody, and Somebody) and 5 names from the top 10 words. In the bag-of-words algorithm its harder to visualize the patterns since there are many meaningless words, but still, we can also see lots of names and pronouns related to people.

The strongest evidence of this is the most frequent entities mentioned by Michael, from the 15 words displayed only one is not a person, and this exception is actually the name of their city. This suggests that Michael is someone whose biggest interests are in people and the community.

https://www.youtube.com/watch?v=vrPgsrfZWOU&feature=youtu.be&t=327

Ratings¶

Correlation¶

Here we can verify the correlation (Pearson method) between the previously analysed measures and the actual ratings for the episodes

We can also compare any given variable with the actual ratings, this helps us visualize how much related those values are.

The options are:
['dialogs', 'mean_sent', 'mean_words', 'mean_positive', 'mean_negative', 'mean_neutral', 'mean_compound', 'total_sent', 'total_words']

Selected variable: dialogs